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There are things we know, things we know we don't know, and then there are things we don't 
know we don't know. In this paper we address the latter two issues in a Bayesian framework, 
introducing the notion of doubt to quantify the degree of (dis)belief in a model given observational 
data in the absence of explicit alternative models. We demonstrate how a properly calibrated doubt 
can lead to model discovery when the true model is unknown. 



I. INTRODUCTION 

Given two or more competing models to describe 
observed data, Bayesian model comparison offers a 
way of determining the preferred model given the 
data and explicit assumptions about prior beliefs. 
The key feature of Bayesian model comparison is 
that it implements Occam's razor, by selecting the 
model that optimally balances quality-of-fit and 
high model predictivity. (See Ij for an introduction.) 
Given a set of known models, however, the Bayesian 
framework usually has little to say about the abso- 
lute quality-of~-fit of the preferred model. This is be- 
cause the underlying philosophy is that there is little 
virtue in rejecting a model if no better alternative is 
present. 

In the frequentist approach, a popular (abso- 
lute) measure of the goodness-of-fit is given by the 
X^-per-degree-of-freedom (x^/dof) rule-of-thumb. 

For 
dis- 
tributed as a x^ distribution. Therefore, if the model 
is appropriate for the data, one expects that x^/ 
dofw 1. An unsatisfactory fit is signaled by x^/ 
dof;» 1, while x^/dof-C 1 usually imphes overfitting, 
hence a model overspecification. Complementary 
to this, principle component analysis (PCA) can be 
used to determine the maximal number of parameters 
a given observational data set can reasonably con- 
strain Pi . Diagonalizing the covariance matrix of the 
parameters and determining how many eigenvalues 
are below a given threshold gives an upper limit on 
the number of parameters that are well-constrained 
by observations, preventing the use of a too-general 
model. The Bayesian framework replaces this with 
the notion of model complexity, see [3| . 

In the Bayesian framework, the question of 
whether the preferred model describes the observa- 
tions "well enough" can be phrased as follows: what 
is the degree of belief that there are no other "reason- 



where X is (twice) the best-fit log-likelihood, 
normally distributed data points, x^/dof is 



able" models that would better describe the observa- 
tions? Historically, the need for fundamentally new 
physics has often been driven by a poor fit between 
data and existing models, at a point where an explicit 
alternative was not available. For example, the devel- 
opment of General Relativity was driven in part by 
the need to explain a single number - the anomalous 
perihelion precession rate of Mercury. The increasing 
complexity of data makes it harder to simply eval- 
uate discrepancies between theory and experiment 
and decide on their significance. In light of the in- 
creasing usage of Bayesian statistical techniques, like 
Markov Chain Monte Carlo (MCMC) algorithms, it 
would be advantageous to develop a reliable mea- 
sure of confidence in the best-fit model that can deal 
with today's large data sets and multi-dimensional 
parameter spaces. This is particularly true in the 
cosmological context, which faces unique difficulties - 
some observations are now so advanced as to be con- 
strained by fundamental limitations on the quality 
of data (cosmic variance). Thus, cosmologists must 
take particular care to extract the maximum amount 
of information from available measurements. 

Our confidence in the (absolute) adequacy of the 
best model can only be determined under general 
assumptions about any hypothetical better fitting 
model. In this paper, we propose a set of assump- 
tions for such a model, define the notion of statistical 
doubt and illustrate its use by computing the doubt 
for a toy linear model. In a future work, we will apply 
this tool to evaluate the current concordance model 
of cosmology. 

First, we give a short review of Bayesian model 
selection. We introduce the notion of doubt, then 
discuss a technique for the calibration of the level of 
false doubt and demonstrate the usefulness of doubt 
for model discovery in an application to a toy linear 
model. Finally, we present our conclusions. 
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II. BAYESIAN MODEL SELECTION 

In this section, we briefly review Bayesian model 
selection. For more details we refer the reader to 
Prom Bayes' theorem, the posterior probability 
of model A4j given the data d, p{Aij\d), is related to 
the Bayesian evidence (or model likelihood) p{d\Mj) 

by 



p{d) 



(1) 



where p{Mj) is the prior belief in model M.j. Here 
and in the following, "model" denotes a choice of 
theory, with specification of its free parameters, dj, 
and of their prior probability distribution, p{Oj\M.j). 
The specification of the prior might be somewhat 
ambiguous for models with continuous free param- 
eters, especially when one is working with an effec- 
tive parameterisation only loosely tied to the under- 
lying physics. (For further discussion of these points, 
see [4] and, for a critical view, (sl). In Eq. ([1]), 
p{d) = '^^p{d\Mi)p{Mi) is a normahsation constant 
(where the sum runs over all available known models 
Mi, i = 1, . . . , N) and 

p{d\Mj) = Jdep{d\ej,Mj)p{ej\Mj) (2) 

is the Bayesian evidence, where p{d\6j,Mj) is the 
likelihood. 

Given two competing models, Mq and A4i, the 
Bayes factor i?oi is the ratio of the models' evidences 



-Boi = 



_ p{d\Mo) 
p{d\Mi) 



(3) 



where large values of i?oi denote a preference for 
Mo, and small values of Bqi denote a preference for 
Ml- The "Jeffreys' scale" (Table H]) gives an empiri- 
cal prescription for translating the values of Bqi into 
strengths of belief. 

Given two or more models, specified in terms of 
their parameterisation and priors on the parame- 



j In Boil 


Odds 


Strength of evidence 


< 1.0 


< 3 : 1 


Inconclusive 


1.0 


~ 3 : 1 


Weak evidence 


2.5 


~ 12 : 1 


Moderate evidence 


5.0 


~ 150 : 1 


Strong evidence 



TABLE L Empirical scale for evaluating the strength of 
evidence when comparing two models. Mo versus Aii 
(so-called "Jeffreys' scale", here slightly modified follow- 
ing the prescriptions given in ^, 'S]). The right -most col- 
umn gives our convention for denoting the different levels 
of evidence above these thresholds. 



ters, it is straightforward (although sometimes com- 
putationally challenging) to compute the Bayes fac- 
tor. Depending on the problem at hand, semi- 
analytical 0, § and numerical ^, [l^, [ll| techniques 
are available. In the usual case where the prior 
of the models is taken to be non-committal (i.e., 
p{Mj) = l/N), the model with the largest Bayes fac- 
tor ought to be preferred. Thus the computation of 
Bqi allows to select one (or a few) promising model(s) 
from a set of known models. However, it contains 
no information about whether the selected model is 
actually a good explanation for the data. This in- 
formation is contained in p{Mj\d). From Eq. it 
is clear that a correct computation of p{Mj\d) re- 
quires the denominator p{d) to be computed from a 
reasonably complete sum of models. 

We now turn to the question of how to evaluate 
our absolute degree of belief in the adequacy of a set 
of known models. 



III. BAYESIAN DOUBT 

A. Introducing doubt 

In light of the observations in the previous section, 
we seek to capitalise on the information in p{Mj\d). 
We introduce the concept of doubt V to describe in a 
quantitative way our degree of (dis)belief in the abil- 
ity of any known model in a list Mi {i — 1, . . . , N) to 
describe the data. We begin by expanding our space 
of models to include an as-yet unknown model X, 
which represents the possibility that the collection of 
models presently under consideration is incomplete 
and that there might be a "better" (in a Bayesian 
sense) model that we have not yet identified. We 
then define the doubt V ~ I?({A^i}|(i) as the poste- 
rior probability of the unknown model, p{X\d), which 
from Bayes' theorem is given by 



V = piX\d) 



pid\X)p{X) 
p{d) 



1 + 



p{d\Mi)p{M,) 
p{d\X)p(X) 



(4) 



where the sum runs over the known models, i 
1, . . . , N. The prior for the unknown model is 

N 



piX)^l-Y,P{M^). 



(5) 



Given some openness about the possibility that our 
list of known models is incomplete, and given an es- 
timate of the Bayesian evidence p{d\X) for X, the 
doubt expresses the posterior probability that the 
list of models Mi is missing a model that is a bet- 
ter description of the available data. If p{X) > 0, 
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then "sufficiently poor evidence" for the known mod- 
els Mi (i.e., p{d\Mi) <^ p{d\X)) wiU instiU enough 
doubt to question the appropriateness oi Mi. Obvi- 
ously, assuming a priori that the known models ex- 
haust the model space, i.e. p{X) = 0, would leave no 
room for doubt: V — independent of the evidence 
pid\X). 

The crucial step in evaluating the doubt is esti- 
mating the evidence for the unknown model, p{d\X). 
Clearly this quantity cannot be computed using 
Eq. as this would require the unknown model 
to be fully specified in terms of its parameters and 
priors. If this was possible, then X could be included 
in the list of Mi and would not be unknown in the 
first place. 

Fortunately, even without an explicitly specified 
mode, but given the data d, we can produce an in- 
formed guess as to what the evidence for a "good" 
model should be. If the evidences of the models on 
the table. Mi, are poor compared to this value, then 
the Bayes factors in favour of the unknown model 



pid\X) 
pid\M,) 



»1, 



(6) 



and as consequently (see Eq. ([4])) the posterior prob- 
ability of doubt will increase. 

What we are suggesting is in fact a calibration of 
the absolute value of the evidence. Bayesian model 
comparison focuses on the Bayes factor, which in- 
dicates the change of our relative confidence in the 
models in light of the observed data. Since the Bayes 
factor is the ratio of the models' evidences, the abso- 
lute value of the evidence itself is usually deemed 
irrelevant. (This is only actually strictly true for 
nested models, where the normalisation of the ev- 
idence drops out of the ratio.) A shortcoming of 
ignoring the absolute value of the evidence is that 
the model comparison will always return a preferred 
model, even in cases when all of the available models 
fit the data poorly. The notion of doubt is designed 
to remedy this obviously unsatisfactory situation, by 
introducing a Bayesian way of dealing with the con- 
cept of absolute quality of fit. This is a familiar con- 
cept from the usual frequentist goodness-of-fit tests, 
which have the advantage of flagging strong discrep- 
ancies between the model and the observed data. In- 
tuitively, it is sensible that we should start doubting 
the adequacy of our model(s) whenever the observed 
data are in poor agreement with their predictions. 
An appropriately calibrated absolute value of the ev- 
idence can be employed within a Bayesian-style rea- 
soning to substantiate our intuition that "something 
fishy" must be going on whenever the data are a poor 
fit to the best model available. 



B. Calibration of the evidence 

The absolute upper bound on the value of the evi- 
dence for the unknown model is achieved for a model 
S which predicts exactly the data that have been ob- 
served (and which has a prior that goes to zero for 
any other observation) . Such a model can be dubbed 
a "sure-thing model" , because it is totally determin- 
istic. However, in most situations of interest, such a 
model is unrealistic, because it does not allow for 
the statistical nature of the measurement process, 
which is subject to noise, neither does it accommo- 
date a possible statistical connection between the ob- 
servables and the underlying physical model, which 
introduces sample variance in certain contexts {e.g. 
cosmic variance in cosmology). Furthermore (as dis- 
cussed in l3| for the conceptually simple case of coin 
tossing) , such models are usually thrown out from the 
beginning, simply because there is a large number of 
them: e.g. for any outcome of N coin fiips, there 
are 2^ different sure-thing models Si "predicting" 
exactly the data that might have been observed. Be- 
cause of their large number, each of the Si should be 
penalised by a prior probability p{Si) ^ 2^^ , which 
goes quickly to for even moderate values of N. For 
all those reasons, calibrating off the absolute maxi- 
mum value of the evidence is undesirable. A more 
realistic calibration is required. 

We suggest to calibrate the evidence using the 
properties of the likelihood and a default (weak) ref- 
erence prior. The first step is to approximate the 
evidence for the unknown model, p(d,\X), via the 
Bayesian Information Criterion (BIC) Q, lH Hi, [H 
ITg} . the derivation of which we sketch below (see 
e.g. for further details). 

Let us denote the likelihood of the unknown model 
by £{0) = p{d\e,X) and the prior by p{e\X). We 
begin by Taylor expanding g{9) — \n[C{0)p{9\X)] 
around the maximum likelihood value, 



To sec- 



ond order, 



9(9) « 5(^max) - 2 - 0,^,^YH{9 - 0„,ax), (7) 

where H is minus the Hessian matrix evaluated at 
the maximum likelihood point. 



rr 

J^ab — — 



d9ad9b 



(8) 



Using this approximation in the calculation of the 
evidence, Eq. ([2]), we obtain 



lnp{d\X) = ln£„,ax + \np{9^,^\X) + - ln(2^) (9) 



■lii\H\+0{n~^). 



4 



where k is the number of parameters in the un- 
known modeL For large samples, we can approxi- 
mate H Ki nl (to order ©(n"^/^)), where / is the ex- 
pected Fisher matrix from a single observation. We 
now assume that the (unknown) prior p{9\X) is a 
multivariate Gaussian approximately centred at flmax 
with Fisher matrix /. This means that the assumed 
prior distribution contains about the same (weak) 
information as would an average single observation. 
Then 

lnp(0,„ax|A') = ln(2^) + ^ In |/|. (10) 

Plugging this reference prior into ^ and with the 
above approximation for if, terms of order 0(1) can- 
cel and we obtain 

\np{d\X) - ln/:^ax - ^ Inn + 0{n-^/^). (11) 

This is the standard expression for the BIG, which 
we will employ to estimate the evidence for the un- 
known model. It requires an estimate of the best-fit 
likelihood £max and of the number of free parame- 
ters, fc, for the unknown model X . Notice that the 
likelihood, when normalised over the data space, is 
a dimensionful quantity, with dimensions [data]"". 
In the following we will always drop such a prefactor 
(and the associated factors of 2tt) as it always cancels 
when considering evidence ratios (for the same data) , 
therefore £max has to be regarded as dimensionless. 

In order to compute the evidence for the unknown 
model from Eq. (|lip . we need to specify an estima- 
tor Lniax = — 21n£niax for (minus twice) the best- 
fit log-likelihood, — 21n£inax, of the unknown model 
X . This can be obtained from the requirement of 
"typicality" of the observed realization under X. 
Assuming that the data are normally distributed, 
— 21n/!max follows an approximate ^^"distribution 
with m — (n — k) degrees of freedom. Then the 
expectation value of £max (as taken over different 
realizations of the data, represented by (•)) follows 
from 

(-21nAnax) =TO. (12) 

Employing (— 21n£,„ax) as an estimator for 
— 21n>Cmax would be equivalent to assuming that the 
unknown model has x^/dof= 1, in agreement with 
the rule-of-thumb for goodness-of-fit tests. This 
however is too harsh a requirement on the perfor- 
mance of the known models Aii- Even if one of the 
known models is indeed the correct one, the real- 
ized maximum likelihood value for that model will 
be smaller than the estimator (i.e. —2\nC'^^^ < 
(—2 ln£max)) in about 50% of realizations of the data 
(for the median and the mean of the chi-square dis- 
tribution are very close). This would lead in many 



cases to unjustified doubt of the correct model as 
a consequence of harmless statistical fluctuations in 
the observed data realization. 

Therefore, instead of using the expectation value, 
the value of £max should be more conservatively esti- 
mated so that for example, — 21n£5^ax < ^max only 
in 100q;% of the data realizations, where we are free 
to choose the value of a. This can be achieved by 
taking Lmax to be the a quantile Xm (a) '^^ ^^"^ 
square distribution with m dof, P^2^ , imax = Xm (a) ' 
defined through 

/oo 
P^2 {x)dx = 1 - a . (13) 

As Xm (a) increases monotonically with a, larger 
values of a lead to smaller (and hence more 
conservative) values for the evidence p{d\X) (x 
exp(— (q)/^) ^'i' CD' principle, one is free 
to choose the value of a, and we calibrate it by de- 
manding the wrongful rejection rate of correct mod- 
els to be smaller than a given value 7 (see Tables HIl 
and mil below) . 

In summary, we suggest to use as an estimator of 
the unknown model's evidence, 

2 

lnp(d\X,k,a) = -^^~^lnn, (14) 

where on the left-hand side we have conditioned ex- 
plicitly on the number of parameters k oi X , and on 
the quantile value a. 

In a Bayesian spirit, one could treat k and a as hy- 
perparameters, by specifying a prior and marginal- 
ising over them in the evidence. Here, we investi- 
gate the behaviour of doubt in a Gaussian linear toy 
model, where k is fixed at a plausible value and a 
is chosen by calibrating its value on the fraction of 
cases where doubt wrongfully grows. 

C. Change in the amount of doubt: 
independence of prior doubt 

The final quantity that needs to be specified in or- 
der to compute the posterior doubt is the prior doubt, 
p{X). It seems to us that values 10"^ <p[X) < 10" ^ 
might be plausible in many cases of interest, but 
higher or lower values are certainly possible. The in- 
teresting question is in fact whether doubt increases 
or decreases in the light of the observed data. We as- 
sume that the prior probability of the known models 
is equally split among them, i.e. 

p(Xi) = p{M2) = ■■■= p{Mn) = ■ 

(15) 
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(This simplifying assumption can easily be relaxed.) 
This leads to the following expression for the relative 
change between the prior and posterior doubt: 



n 



V 



p{X\d, k, a) 
-{l-p{X))- 



(16) 



,(o)/2„-fc/2 



where we have defined the average known models' 
evidence 



The doubt grows if 7?. > 1, i.e. for 



,(„)/2 -fe/2 



< 1, 



(17) 



(18) 



independently of the prior probability for doubt (as 
long as this is strictly greater than zero). 

Let us consider the asymptotic behaviour of the 
criterion given by Eq. (fT8|) for a large number of 
data points, both under the assumption that the true 
model is present in and the assumption that it is ab- 
sent from the list of known models. 

If the true model A^t is within the list Aii, then 
by construction of the BIC 



lim p{d\MT) = = e-^™.(o.^)/2n-'^-/2 ^ 



hence to leading order 



)/2^fc/2 



- exp(A„/2) 



(19) 



(20) 



Here A„ ee x^,(„) - X™,(o.5)' ^'^^^ the inverse 

of the cumulative distribution function (CDF) 
with m degrees of freedom. Clearly, Aq, > for 
a > 0.5. In the limit of many data points, n — s- cx) (or 
equivalently m — > oo), xfn (a) approximated 
by the inverse of the CDF of the normal distribution 
with mean m and variance 2m, Af{m, 2m), so that 



lim Aa — my. ( 2\/2 InverseErf(2 

n — >oo \ 



a — 1 



(21) 



where InverseErf(a;) is the inverse of the Error Func- 
tion. It follows that Aq ^ 1 and therefore from 
(f20|) and p6|l for many data points n (or degrees of 
freedom m), 7^ ^ 0, and hence I? — > 0. In other 
words: if the true model is in the list of known mod- 
els, the doubt goes to zero as expected. Notice that 
in Eq. ([SO]) the extra factor 1 /N of penalising for the 
true model comes from the fact that its predictivity 
has been spread among a set of N possibilities. More 



precisely, the 1/A^ factor assumes that the evidences 
for the other known models are negligible in the sum. 
However, if there are M < N other models which are 
unnecessarily more complicated than the true model 
with parameters that are unconstrained by the data, 
one would expect that the evidence for each of those 
models is of the same order as for the true model (be- 
cause the evidence does not penalise unconstrained 
parameters). Therefore the 1/A^ factor would be re- 
placed by a factor (Af + 1)/A^. 

If, instead, the true model (or another model that 
is about as good as the true model in explaining the 
data) is not within the list of known models, then 
the numerator in Eq. (jlSp will drop very quickly to 
zero, hence 



lim n 



1 



p{X) 



(22) 



Therefore the doubt goes to unity, I? — > 1, which 
leads to questioning the completeness of our list of 
known models. 

Further modelling requires the explicit specifica- 
tion of the known models and computation of E from 
the observed data. We therefore proceed with an il- 
lustration based on linear models. 



IV. ILLUSTRATION: LINEAR TOY MODEL 

It is instructive to look at an example for the usage 
of doubt in a simple toy model. Consider the case of 
a Gaussian linear model: 



y = Ae + e., 



(23) 



where the dependent variable y is a, n-dimensional 
vector of observations, is a vector of dimension c 
of unknown regression coefhcients and A is a n x c 
matrix of known constants which specify the rela- 
tion between the input variables 9 and the dependent 
variables y. Furthermore, e is a n-dimensional vector 
of random variables with zero mean (the noise) . We 
assume that the observations are independent identi- 
cally distributed (i.i.d.), hence e follows a multivari- 
ate Gaussian distribution with unit covariance and 
the likelihood is given by 



C{6) = exp 



1 

2 ^ 

1=1 



- V: 



th 



(24) 



where yi{yf^) are the values of the observed (pre- 
dicted) observables, and a = 1. 

For the purpose of our example, let us assume that 
we have n data points, and that two models are avail- 
able: 

• Aio : y = 6, i.e. c = 1 and A = (1, . . . , 1)* and 



6 



0.08 



« 0.06 



g 0.04 



o 
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FIG. 1: Distribution of doubt values when the correct 
model is used to fit the data, for 1000 realizations of 100 
data points, for prior doubt p{X) = 10~^ (blue, dashed) 
and p{X) = IQ-^ (red, solid), and a = 0.95. 



• Ml : y = 9x, i.e. c = 1 and A = {xi, . . . ,XnY ■ 

In both cases there is one free parameter, 6, and 
we will assume that a prior is available of the form 
p{e\Mo) = p{e\Mi) = 1/4 for e e [-2, 2] (and van- 
ishes outside that range). We will assume that there 
is one free parameter in the unknown model, i.e. 
k = 1. (The number of effective free parameters can 
be investigated further by the mean of the Bayesian 
complexity, see [1].) 

We are interested in investigating the behaviour of 
the ratio of the posterior to the prior doubt TZ. We 
expect TZ < 1 (decreasing doubt) when the known 
model is the correct underlying distribution, and 
TZ > \ when an incorrect known model is used. For 
definiteness, we will take the true model to be Aii, 

with ^truc = 0.1. 



Doubt when fitting data with the correct 
modeh false doubt 



Let us assume that our list of known models con- 
tains only Ml [i.e. N = 1 and A^o is not on the 
list). We first fit the dataset (generated from A^i) 
with the correct model M i and we compute the pos- 
terior p{Mi\d) and the doubt p{X\d) from Eq. dU, 
using a — 0.95 In this case, there should be no reason 
for doubt as we expect i to be an adequate descrip- 
tion of the data. We show the ensuing distribution of 
posterior doubt in Fig. [1] (from 1000 data realizations 



with n = 100 data points each), for two different 
choices of prior doubt, p{X) ~ 10^^ (red/sohd his- 
togram) and p(A') = 10""'^ (blue/dashed histogram). 
The posterior doubt of the vast majority of the re- 
alizations is smaller than the prior doubt, consistent 
with expectations. Clearly, the absolute value of the 
posterior doubt depends on the choice of prior doubt, 
and quite reasonably so. If a priori one is quite cer- 
tain that the model is correct, then small deviations 
from a perfect fit will not shake one's belief in the 
model. However, if the prior doubt is relatively large, 
p{X) = 10~^ (i-e., if a priori one is quite uncertain 
that the model being used is correct) then already 
small fluctuations in the data will lead to relatively 
strong posterior doubt. The less confident one is to 
begin with, the more easily one's belief in a model is 
shaken by statistical fluctuations. In any case, larger 
amounts of data will lessen the effect of fluctuations, 
leading to little doubt about the correct model, with 
the amount of data required for persuasion depen- 
dent on the prior doubt. For both values of prior 
doubt in Fig.[Tl about 14% of the realizations lead to 
a posterior doubt that is larger than the prior doubt. 
This number is independent of the prior doubt as can 
be seen from Eq. (|16p and decreases with increasing 
number of data points n and as a approaches unity. 
We call realizations that incorrectly give an increase 
of doubt (although the model being used is the cor- 
rect one) cases of "false doubt" . 

We now turn to investigate the relative change in 
doubt, TZ, which is plotted in Fig. [21 for two choices 
of the number of data points n and of the calibration 
parameter a. Recall that \ogTZ > (< 0) corre- 
sponds to an increase (decrease) in doubt in light of 
the observed data. In fact, TZ can be regarded as a 
sort of "Bayes factor" for doubt change — it gives 
the relative change in our "state of doubt" after we 
have seen the data. While the actual value of TZ is 
dependent on the prior doubt, the threshold sepa- 
rating increasing doubt from decreasing doubt (i.e., 
log??. = 0) is independent of p{X). As expected, a 
larger number of data points leads to a decrease in 
the fraction of realizations for which doubt wrong- 
fully grows (values \ogTZ > 0). The same is true if 
one increases a. As explained above, this is because 
a larger value of a leads to a less harsh penalty for 
odd features in the realized data. 

By construction of the doubt, there is a strong 
correlation between the value of log TZ and the chi- 
square per dof of the best fit. This is depicted in 
Fig. [31 First, it is obvious that for a given choice 
of a, data realizations leading to a wrongful increase 
in doubt (log 7?. > 0) are the ones that present "un- 
lucky" features, i.e. the ones with a large value of 
X^/dof. In other words, cases of false doubt would 
be suspicious even using a more traditional measure 
of the quality of fit. However, the second, crucial 
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FIG. 2: Distribution of the change in doubt log??, when 
fitting data with the correct model, from 1000 data real- 
izations for n = 5, a — 0.95 (blue/solid, thin), n — 5,a — 
0.99 (red/dashed, thin), n = 100, a = 0.95 (blue/solid, 
thick) and n = 100, a = 0.99 (red/dashed, thick). Val- 
ues log 7^ < correspond to a decrease in the amount of 
doubt. A larger value of a and a larger number of data 
points n lead to a reduction of the number of cases where 
the doubt wrongly grows (values log TZ > 0). 



FIG. 3: Correlation between the "chi-square-per-dof" 
rule and the change in doubt, logTJ, for different values 
of the calibration parameter a, increasing from right to 
left (for 1000 data realizations). The parameter a can 
be chosen so that only a pre-determined fraction 7 of 
data realizations lie in the "false doubt" zone (shaded, 
log??. > 0). Here, a has been chosen in such a way that 
(from right to left) 7 = 0.50, 0.05, 0.01. 



point is that the parameter a can be calibrated in 
order to achieve a pre-determined fraction of false 
doubt from a known model. By increasing a, the lo- 
cus of the realizations shifts to the left of the plot. 
Therefore one can choose a in such a way that the 
probability of false doubt is below a given threshold. 
This is discussed in the next section. 



B. Calibration of the level of false doubt 

As shown above, some fraction of data realizations 
will always lead to false doubt. This fraction depends 
on the value of a and on the number of data points, 
n, but not on the level of prior doubt. (There is a 
further, if subdominant, dependence on k.) 

For a given number of data points, it is desir- 
able to tune the value of a such that the fraction of 
false doubt 7 is (on average) below a predetermined 
threshold. This is achieved as follows. Starting from 
the model distribution (here. Mi), we employ cur- 
rent data to derive constraints on its free parameters, 
as usual in the inference step. We then select an es- 
timator 9 for the value of the parameters (here, 9), 
which will usually be cither the best-fit point or the 
posterior mean. We simulate 10^ realizations of the 



data from the model, assuming a fiducial value 9 for 
its parameters. We then compute the doubt for each 
realization, and calibrate the value of a by requir- 
ing that the fraction of realizations with log 7?. > 
be below a value 7 [lB|- To further reduce the scat- 
ter in a, we average over the resulting a values from 
1000 such procedures. Table [III shows such values of 
a(7, n,Aii) for a few representative choices of 7 and 
n. One striking feature of the calibration table is 
that the value of a required for a false doubt rate 7 
is systematically much larger than 1 — 7. This is an- 
other reflection of the well known fact that how likely 
the data are given the hypothesis does not by itself 
determine how probable the hypothesis is given the 
data. Inferring the latter requires the use of Bayes 
theorem. (For an in-depth discussion of this point, 
see 0). 

Because the calibrated value of a decreases mono- 
tonically with increasing n, the above calibration 
procedure, once carried out for a certain number of 
data points, is expected to be conservative when the 
amount of data increases. This is shown in Fig. [4l 
where we plot the fraction of realizations leading to 
false doubt after a has been calibrated at rt = 100. 
We can see that for n > 100 the fraction of false 
doubt remains below the calibrated level, and that 



8 



10 
100 
200 
1000 



7 = 0.01 7 = 0.05 7 = 0.50 



0.99969(3) 0.9980(1) 0.9550(8) 

0.9980(2) 0.9869(7) 0.747(4) 

0.9968(3) 0.9810(9) 0.684(4) 

0.9942(6) 0.968(1) 0.586(5) 



TABLE II: Values of 0(7, n, A^i), as a function of the 
number of data points, n, ensuring an average fraction of 
false doubt of 7 = 1%, 5%, 50% and for model distribution 
Ml. The number in brackets denotes the uncertainty in 
the last digit. 



n 
2 
3 
4 
5 
6 
7 
8 
9 
10 



7 = 0.01 7 = 0.05 7 = 0.50 



0.9944(5) 
0.9938(6) 
0.9935(6) 
0.9933(7) 
0.9932(7) 
0.9931(7) 
0.9929(7) 
0.9928(7) 
0.9928(7) 



0.973(1 
0.969(1 
0.967(2 
0.966(2 
0.965(2 
0.965(2 
0.964(2 
0.963(2 
0.963(2 



0.781(2) 
0.692(3) 
0.654(4) 
0.633(4) 
0.618(4) 
0.607(4) 
0.600(4) 
0.593(4) 
0.588(5) 



o 0.06 



d 0.02 



' \ 

\ 

\ 

_ \ 

\ 

\ 

N, 

\ 

\ 

\ 

\ 


1 1 1 1 1 1 1 1 1 1 1 1 1 1 

7 = 0.05 


^^^^^^^^^^^^^^^^^^^^^^^^ ^ 


7 = 0.01 


... 1 ... . 


.... 1 , ... 1 ... . 



50 100 150 200 250 

n 

FIG. 4: Fraction of cases of false doubt, log 7?, > as 
a function of the number of data points, n, employing a 
doubt calibration parameter a corresponding to a false 
doubt probability of 7 = 5% (dashed red) and 7 — 1% 
(solid blue) for n = 100. For a number of data points 
n > 100 there is a residual (if mild) n dependence in the 
fraction of false doubt, which however is always below 
the calibration level. The calibration is independent of 
the prior doubt. 



the residual n dependency is fairly mild. 



C. Doubt when fitting data with an incorrect 
model: model discovery 

We now pretend that the true model where the 
data come from, Aii, is unknown to us. We take 
A4o to be the only known model, and consequently 
fit the data with it. We repeat the calibration pro- 
cedure for a for the known model A^o- The corre- 
sponding calibrated values of a(7,n,A^o) are given 
in Table Uni 

Using the calibrated values of a we then compute 



TABLE III: Values of a{'-y,n, Mo), as a function of the 
number of data points, n, ensuring an average fraction of 
false doubt of 7 = 1%, 5%, 50% and for the known model 
Mo- The number in brackets denotes the uncertainty in 
the last digit. 



the posterior doubt on Mq. The result is shown in 
Fig. [5] (again averaged over 1000 realizations). We 
can see that doubt increases immediately from its 
prior value p{X) = 10~^, and tends very quickly to 
1. This clearly signals the inadequacy of the known 
model to fit the data. We should therefor question 
the correctness of the known model and suspect the 
existence of a better model. Therefore our procedure 
leads to model discovery in the absence of an explicit 
specification of the alternative, true model. 



D. Generahsation to multi-models cases and 
discussion 

In the example considered so far, we have only 
dealt with doubt (or its absence) for one model at 
a time. The situation is qualitatively similar when 
several alternative known models are available (i.e. 
for N > 1). 

When several known models exist, the calibration 
procedure should be carried out on the model that 
is currently the best among them, i.e. on the model 
with the largest Bayes factor. This ensures that the 
probability of false doubt is under control for the 
currently favoured model (which has the largest ev- 
idence). If one of the known models is clearly pre- 
ferred, then the situation is qualitatively similar to 
the case of = 1 (since the evidence from the other, 
poorer models contributes very little to the sum in 
the definition of doubt). If instead several models 
have a similar value of the evidence under present 
data, then it is expected that the outcome of the 
calibration should be quite similar for any of them. 
Also, as discussed above in the presence of M mod- 
els with approximately the same evidence there is an 
extra "volume factor" M/N to take into account in 
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FIG. 5: Model discovery: posterior doubt on the (wrong) 
model Mo as a function of the number of data points, 
n, here for prior doubt p{X) = 10~^ using the value of 
a calibrated to a fraction of false doubt 7 = 1% (solid 
blue line), 7 = 5% (dashed red line), 7 — 50% (dotted 
black line). The points give the mean doubt over 1000 
data realizations and the vertical bars indicate the range 
of values enclosing 95% of the realizations. The poste- 
rior doubt goes very quickly to 1 (in fact, for n > 5, all 
realizations have a posterior doubt of unity), therefore 
leading to doubt the correctness of the model. 



theoretical input is required. 

In general, we remark that the existence of known 
models that are unnecessarily complex (i.e., with 
more free parameters than the true model) is not di- 
rectly addressed by doubt. In this case, an analysis of 
doubt (properly calibrated) will return a null result, 
i.e. no reason for doubting the adequacy of the overly 
complex model. One has to keep in mind that doubt 
is a tool for model discovery, whose primarily goal is 
to point towards the need to enlarge (or change com- 
pletely) the space of the known models. The shed- 
ding of unnecessary levels of complexity is instead a 
task best accomplished by simultaneously analysing 
the evidence and Bayesian complexity. (See [3| for 
an application.) 

Finally, the usual caveats apply about the depen- 
dence on the volume in parameter space enclosed 
by the parameters' prior p{6j\Mj), as is always the 
case for calculations involving the Bayesian evidence. 
(See P, ITtI for a discussion.) However, the calibra- 
tion procedure for a automatically accounts for the 
volume enclosed by the chosen prior under Mj, as 
compared with the reference prior employed for the 
unknown model. If one were to change the prior on 
the parameters of the known model, then this would 
amount effectively to a change of model. (As men- 
tioned, we consider a model specification to consist of 
both the model parameters and their prior.) There- 
fore the calibration ought to be performed again on 
the new model. 



the rate of false doubt. 

The false doubt calibration procedure introduced 
here insures against unjustified doubt of the known 
models at a given threshold (set by 7). Because un- 
known models belong to the world of unknown un- 
knowns, it is more difficult to calibrate the perfor- 
mance of doubt for model discovery, i.e. for justi- 
fiably doubting false models. Whether or not the 
doubt does increase when the true model is genuinely 
unknown depends on how different that unknown 
true model is from the known models. Here, "differ- 
ent" must be interpreted in terms of a "distance" in 
the space of models, as measured by the Bayesian ev- 
idence. In this sense, the notion of doubt introduces 
an absolute metric in model space, to complement 
the relative metric represented by the Bayes factor. 
In any case, if the true model is not very different in 
its observational consequences from one of the known 
models (in which case doubt will not increase), then 
one might conclude that the known model is a phe- 
nomenologically accurate description of the presently 
available data. If doubt does increase, though, this 
is a signal that the best available model is an inad- 
equate description of the observations and that new 



V. CONCLUSIONS 

Checking the appropriateness of a given set of 
models to describe observations is not a usual task 
in a Bayesian framework. We have suggested an 
intuitive approach to doubt in a Bayesian context 
that shares some philosophy with the frequentist ap- 
proach: after all, the estimator for iZmax is based on 
a x^/dof argument and strictly speaking, Bayesians 
should show little interest in the hypothetical out- 
come of different realizations of reality. Neverthe- 
less, as we demonstrated with the example of a sim- 
ple linear model, the concept of doubt is more pow- 
erful than traditional goodness-of-fit tests provided 
the parameter a controlling the rate of false doubt is 
correctly calibrated. 

As mentioned in the introduction, the concept of 
doubt is ideally suited for applications in cosmol- 
ogy. Huge data sets and multi-dimensional parame- 
ter spaces do not lend themselves very well to visual 
inspection. Computing the doubt 2? - a single num- 
ber - gives an indication of the trustworthiness of 
the model(s) under consideration in a Bayesian con- 
text. Applications range from questions about the 
very early inflationary phase of the universe (in par- 



10 



ticular about the shape of the primordial power spec- 
trum generated during inflation) to the future evolu- 
tion of the Universe which appears to be dominated 
by dark energy (in particular whether the equation 
of state during late time acceleration is constant). 
Given the need to calibrate the doubt, such work re- 
quires a large amount of computational power even 
given recent advances in numerical techniques for the 
evaluation of the evidence, and it will be addressed 
in a future paper. 
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